{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "hdr01",
   "metadata": {},
   "source": [
    "# Save Output Files\n",
    "\n",
    "Assembles the final deliverable files in `../Output/`:\n",
    "1. Copy reference CSVs from `../Input/` (ACS groups, US election file)\n",
    "2. Copy AL county election file from `Process/` (generated by notebook 5)\n",
    "3. Pivot and enrich vehicle-count files\n",
    "3b. Filter and copy AL zip-code election file (to ACS/Vehicle footprint)\n",
    "4. Save feature-selection results (filtered to cutoff sub-groups)\n",
    "5. Copy the selected-subgroups dictionary from `Process/`\n",
    "6. Compute derived features and save ACS Enhanced data files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hdr02",
   "metadata": {},
   "source": [
    "## 1. Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cfg01",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input:  C:\\Users\\myavuz\\OneDrive - The University of Alabama\\Work\\SharedCode\\ScientificData_Descriptor_EV\\Revision0\\Input\n",
      "Output: C:\\Users\\myavuz\\OneDrive - The University of Alabama\\Work\\SharedCode\\ScientificData_Descriptor_EV\\Revision0\\Output\n"
     ]
    }
   ],
   "source": [
    "import shutil\n",
    "from pathlib import Path\n",
    "import polars as pl\n",
    "\n",
    "INPUT_DIR  = Path(\"../Input\")\n",
    "OUTPUT_DIR = Path(\"../Output\")\n",
    "OUTPUT_DIR.mkdir(exist_ok=True)\n",
    "\n",
    "# Canonical county names — map known spelling variants to a single form\n",
    "COUNTY_NAME_MAP = {\n",
    "    \"StClair\": \"St. Clair\",\n",
    "}\n",
    "\n",
    "def normalize_county(name: str) -> str:\n",
    "    \"\"\"Strip ' County' suffix and apply spelling fixes.\"\"\"\n",
    "    clean = name.replace(\" County\", \"\")\n",
    "    return COUNTY_NAME_MAP.get(clean, clean)\n",
    "\n",
    "print(f\"Input:  {INPUT_DIR.resolve()}\")\n",
    "print(f\"Output: {OUTPUT_DIR.resolve()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hdr03",
   "metadata": {},
   "source": [
    "## 2a. Copy reference CSVs from Input\n",
    "\n",
    "Copy the ACS group list and US-level election file from `../Input/`.\n",
    "Vehicle-count files are handled in section 3.\n",
    "AL-level election files are generated by notebook 5 and copied from\n",
    "`Process/` in section 2b."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "copy01",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ACSGroups_Initial.csv  ->  Output/ACSGroups_Initial.csv\n",
      "  GeneralElectionVotes_USStates.csv  ->  Output/GeneralElectionVotes_USStates.csv\n",
      "\n",
      "Copied 2 files from Input/\n"
     ]
    }
   ],
   "source": [
    "SKIP_PREFIXES = [\"VehicleCountsByEVAFuelType_\"]\n",
    "SKIP_FROM_INPUT = [\n",
    "    \"GeneralElectionVotes_ALCounties.csv\",\n",
    "    \"GeneralElectionVotes_ALZipCodes.csv\",\n",
    "    \"2024AlRegVehicleCountsByCountyPrecinctZipCode.csv\",\n",
    "    \"PrecinctVoteCounts.csv\",\n",
    "]\n",
    "\n",
    "input_csvs = sorted(INPUT_DIR.glob(\"*.csv\"))\n",
    "copied = 0\n",
    "for src in input_csvs:\n",
    "    if any(src.name.startswith(p) for p in SKIP_PREFIXES):\n",
    "        continue\n",
    "    if src.name in SKIP_FROM_INPUT:\n",
    "        continue\n",
    "\n",
    "    shutil.copy2(src, OUTPUT_DIR / src.name)\n",
    "    print(f\"  {src.name}  ->  Output/{src.name}\")\n",
    "    copied += 1\n",
    "\n",
    "print(f\"\\nCopied {copied} files from Input/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8mfkzz4bsm",
   "metadata": {},
   "source": [
    "## 2b. Copy AL county election file from Process/\n",
    "\n",
    "The county file is generated by notebook 5 (`5-DistributeElectionVotes`).\n",
    "County names are normalized via `normalize_county()`.\n",
    "\n",
    "The zip-code election file is also generated by notebook 5 but is filtered\n",
    "to the ACS/Vehicle zip-code set in section 3b (after section 3 determines\n",
    "the valid locations)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "fgfd1gqm7s",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  GeneralElectionVotes_ALCounties.csv  ->  Output/  (county names normalized)\n"
     ]
    }
   ],
   "source": [
    "# --- County file: normalize county names ---\n",
    "county_elec = pl.read_csv(\"GeneralElectionVotes_ALCounties.csv\")\n",
    "loc_col = county_elec.columns[0]\n",
    "county_elec = (\n",
    "    county_elec.with_columns(\n",
    "        pl.col(loc_col).map_elements(normalize_county, return_dtype=pl.Utf8)\n",
    "    )\n",
    "    .sort(loc_col)\n",
    ")\n",
    "county_elec.write_csv(OUTPUT_DIR / \"GeneralElectionVotes_ALCounties.csv\")\n",
    "print(f\"  GeneralElectionVotes_ALCounties.csv  ->  Output/  (county names normalized)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hdr_veh",
   "metadata": {},
   "source": [
    "## 3. Pivot and enrich vehicle-count files\n",
    "\n",
    "For each geographic level the raw `VehicleCountsByEVAFuelType_*.csv` is in\n",
    "long format (one row per location × fuel type). Pivot to wide format with\n",
    "fuel-type columns in order GDV, HEV, PHEV, BEV (and HFCV when present),\n",
    "then add derived columns:\n",
    "- **Total Vehicle Count** — sum of all fuel-type counts\n",
    "- **BEV Proportion** — BEV Count / Total Vehicle Count\n",
    "- **Population** — `B01003_E001` from the matching ACS file\n",
    "- **BEV per Capita** — BEV Count / Population\n",
    "\n",
    "Level-specific filters:\n",
    "- **ALZipCodes**: exclude zip codes missing from either ACS or vehicle data,\n",
    "  or with population = 0 or total vehicle count = 0\n",
    "- **USStates**: remove Puerto Rico"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "veh01",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ALZipCodes: 653 -> 635 rows (dropped 18 with zero population or vehicles)\n",
      "  Saved Output/VehicleCountsByEVAFuelType_ALZipCodes.csv  (635 rows, 9 cols)\n",
      "  Saved Output/VehicleCountsByEVAFuelType_ALCounties.csv  (67 rows, 9 cols)\n",
      "  USStates: removed Puerto Rico (51 -> 51 rows)\n",
      "  Saved Output/VehicleCountsByEVAFuelType_USStates.csv  (51 rows, 10 cols)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "FUEL_ORDER = [\"GDV\", \"HEV\", \"PHEV\", \"BEV\", \"HFCV\"]\n",
    "\n",
    "LEVELS = [\n",
    "    # (suffix, vehicle location col, ACS file, location col in ACS, output location col)\n",
    "    (\"ALZipCodes\",  \"ZipCode\", \"ACS_ALZipCodes.csv\",  \"Location\", \"Zip Code\"),\n",
    "    (\"ALCounties\",  \"County\",  \"ACS_ALCounties.csv\",  \"Location\", \"County\"),\n",
    "    (\"USStates\",    \"State\",   \"ACS_USStates.csv\",    \"Location\", \"State\"),\n",
    "]\n",
    "\n",
    "acs_frames = {}  # cache ACS data for section 6\n",
    "\n",
    "for suffix, veh_loc_col, acs_file, acs_loc_col, out_loc_col in LEVELS:\n",
    "    # --- Load vehicle data and pivot ---\n",
    "    veh = pl.read_csv(INPUT_DIR / f\"VehicleCountsByEVAFuelType_{suffix}.csv\")\n",
    "    fuel_types = [f for f in FUEL_ORDER if f in veh[\"EVAFuelType\"].unique().to_list()]\n",
    "\n",
    "    wide = (\n",
    "        veh.pivot(on=\"EVAFuelType\", index=veh_loc_col, values=\"Count\")\n",
    "        .select([veh_loc_col] + fuel_types)\n",
    "        .fill_null(0)\n",
    "        .rename({f: f\"{f} Count\" for f in fuel_types})\n",
    "    )\n",
    "\n",
    "    count_cols = [f\"{f} Count\" for f in fuel_types]\n",
    "    wide = wide.with_columns(\n",
    "        pl.sum_horizontal(count_cols).alias(\"Total Vehicle Count\")\n",
    "    )\n",
    "    wide = wide.with_columns(\n",
    "        (pl.col(\"BEV Count\") / pl.col(\"Total Vehicle Count\")).alias(\"BEV Proportion\")\n",
    "    )\n",
    "\n",
    "    # --- Load ACS and extract population ---\n",
    "    acs = pl.read_csv(acs_file)\n",
    "\n",
    "    # Align location columns for join\n",
    "    if suffix == \"ALCounties\":\n",
    "        # ACS has \"Autauga County\", vehicle has \"Autauga\" — normalize both\n",
    "        acs = acs.with_columns(\n",
    "            pl.col(acs_loc_col)\n",
    "            .map_elements(normalize_county, return_dtype=pl.Utf8)\n",
    "            .alias(\"_join_loc\")\n",
    "        )\n",
    "        wide = wide.rename({veh_loc_col: \"_join_loc\"})\n",
    "    elif suffix == \"ALZipCodes\":\n",
    "        acs = acs.with_columns(\n",
    "            pl.col(acs_loc_col).cast(pl.Int64).alias(\"_join_loc\")\n",
    "        )\n",
    "        wide = wide.rename({veh_loc_col: \"_join_loc\"})\n",
    "    else:\n",
    "        acs = acs.rename({acs_loc_col: \"_join_loc\"})\n",
    "        wide = wide.rename({veh_loc_col: \"_join_loc\"})\n",
    "\n",
    "    # Join population\n",
    "    pop = acs.select([\"_join_loc\", \"B01003_E001\"]).rename({\"B01003_E001\": \"Population\"})\n",
    "    wide = wide.join(pop, on=\"_join_loc\", how=\"inner\")\n",
    "\n",
    "    # --- Level-specific filters ---\n",
    "    if suffix == \"ALZipCodes\":\n",
    "        before = wide.height\n",
    "        wide = wide.filter(\n",
    "            (pl.col(\"Population\") > 0) & (pl.col(\"Total Vehicle Count\") > 0)\n",
    "        )\n",
    "        print(f\"  ALZipCodes: {before} -> {wide.height} rows \"\n",
    "              f\"(dropped {before - wide.height} with zero population or vehicles)\")\n",
    "\n",
    "    if suffix == \"USStates\":\n",
    "        before = wide.height\n",
    "        wide = wide.filter(pl.col(\"_join_loc\") != \"Puerto Rico\")\n",
    "        print(f\"  USStates: removed Puerto Rico ({before} -> {wide.height} rows)\")\n",
    "\n",
    "    # Add BEV per Capita\n",
    "    wide = wide.with_columns(\n",
    "        (pl.col(\"BEV Count\") / pl.col(\"Population\")).alias(\"BEV per Capita\")\n",
    "    )\n",
    "\n",
    "    # Rename join col to standardized output name and sort\n",
    "    wide = wide.rename({\"_join_loc\": out_loc_col}).sort(out_loc_col)\n",
    "\n",
    "    # Save\n",
    "    out_name = f\"VehicleCountsByEVAFuelType_{suffix}.csv\"\n",
    "    wide.write_csv(OUTPUT_DIR / out_name)\n",
    "    print(f\"  Saved Output/{out_name}  ({wide.height} rows, {wide.width} cols)\")\n",
    "\n",
    "    # Cache ACS with the same location filter for section 6\n",
    "    valid_locations = set(wide[out_loc_col].to_list())\n",
    "    if suffix == \"ALCounties\":\n",
    "        # Replace original Location with normalized names\n",
    "        acs_frames[suffix] = (\n",
    "            acs.filter(pl.col(\"_join_loc\").is_in(valid_locations))\n",
    "            .drop(acs_loc_col)\n",
    "            .rename({\"_join_loc\": out_loc_col})\n",
    "        )\n",
    "    elif suffix == \"ALZipCodes\":\n",
    "        acs_frames[suffix] = (\n",
    "            acs.filter(pl.col(\"_join_loc\").is_in(valid_locations))\n",
    "            .drop(\"_join_loc\")\n",
    "            .rename({acs_loc_col: out_loc_col})\n",
    "        )\n",
    "    else:\n",
    "        acs_frames[suffix] = (\n",
    "            acs.filter(pl.col(\"_join_loc\").is_in(valid_locations))\n",
    "            .rename({\"_join_loc\": out_loc_col})\n",
    "        )\n",
    "\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cl7voioimq7",
   "metadata": {},
   "source": [
    "## 3b. Filter and copy AL zip-code election file\n",
    "\n",
    "The election zip-code file produced by notebook 5 covers all zip codes\n",
    "reachable through the vehicle-registration-to-precinct mapping (759 zips).\n",
    "Many of these have no ACS population or no vehicle records and carry\n",
    "negligible vote counts (~0.07% of total votes).  Filter to the 635 zip\n",
    "codes retained by the ACS/Vehicle pipeline (section 3) so that all three\n",
    "data sources share an identical geographic footprint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3cp7h5fbhya",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  GeneralElectionVotes_ALZipCodes.csv  ->  Output/\n",
      "    635 -> 635 rows (dropped 0 zip codes not in ACS/Vehicle)\n",
      "    Lost votes: 0 / 2,265,060 (0.0000%)\n"
     ]
    }
   ],
   "source": [
    "# Load full zip-code election file from Process/\n",
    "zip_elec = pl.read_csv(\"GeneralElectionVotes_ALZipCodes.csv\")\n",
    "before_count = zip_elec.height\n",
    "\n",
    "# Valid zip codes from section 3 (ACS/Vehicle intersection, non-zero pop & vehicles)\n",
    "valid_zips = set(acs_frames[\"ALZipCodes\"][\"Zip Code\"].to_list())\n",
    "\n",
    "# Filter to valid zips\n",
    "zip_col = zip_elec.columns[0]  # \"Zip Code\"\n",
    "zip_elec = zip_elec.filter(pl.col(zip_col).is_in(valid_zips)).sort(zip_col)\n",
    "after_count = zip_elec.height\n",
    "dropped = before_count - after_count\n",
    "\n",
    "# Quantify lost votes\n",
    "def parse_vote_col(s: pl.Series) -> pl.Series:\n",
    "    return s.cast(pl.Utf8).str.replace_all(\",\", \"\").cast(pl.Int64)\n",
    "\n",
    "full_elec = pl.read_csv(\"GeneralElectionVotes_ALZipCodes.csv\")\n",
    "total_all = parse_vote_col(full_elec[\"Total Popular Vote\"]).sum()\n",
    "kept_total = parse_vote_col(zip_elec[\"Total Popular Vote\"]).sum()\n",
    "lost_votes = total_all - kept_total\n",
    "lost_pct = lost_votes / total_all * 100\n",
    "\n",
    "zip_elec.write_csv(OUTPUT_DIR / \"GeneralElectionVotes_ALZipCodes.csv\")\n",
    "print(f\"  GeneralElectionVotes_ALZipCodes.csv  ->  Output/\")\n",
    "print(f\"    {before_count} -> {after_count} rows (dropped {dropped} zip codes not in ACS/Vehicle)\")\n",
    "print(f\"    Lost votes: {lost_votes:,} / {total_all:,} ({lost_pct:.4f}%)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hdr_fs",
   "metadata": {},
   "source": [
    "## 4. Save feature-selection results\n",
    "\n",
    "Filter `Results_micro_individual.csv` to the sub-groups selected at\n",
    "the cutoff, rename and reorder columns, and save to\n",
    "`Output/FeatureSelection.csv`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fs01",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selected sub-groups at cutoff: 97\n",
      "Saved FeatureSelection.csv  (137 rows)\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (137, 5)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>Dependent Variable</th><th>Order</th><th>SubGroupID</th><th>NumFeatures</th><th>Adjusted_R2</th></tr><tr><td>str</td><td>i64</td><td>str</td><td>i64</td><td>f64</td></tr></thead><tbody><tr><td>&quot;BEV Count&quot;</td><td>1</td><td>&quot;B15002_3&quot;</td><td>16</td><td>0.944761</td></tr><tr><td>&quot;BEV Count&quot;</td><td>2</td><td>&quot;B25104_2&quot;</td><td>32</td><td>0.960681</td></tr><tr><td>&quot;BEV Count&quot;</td><td>3</td><td>&quot;B06009_4_C&quot;</td><td>36</td><td>0.967735</td></tr><tr><td>&quot;BEV Count&quot;</td><td>4</td><td>&quot;B25040_2&quot;</td><td>45</td><td>0.971215</td></tr><tr><td>&quot;BEV Count&quot;</td><td>5</td><td>&quot;C24050_2&quot;</td><td>63</td><td>0.973886</td></tr><tr><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td></tr><tr><td>&quot;BEV per Capita&quot;</td><td>44</td><td>&quot;B06012_6_P1&quot;</td><td>273</td><td>0.932551</td></tr><tr><td>&quot;BEV per Capita&quot;</td><td>70</td><td>&quot;C24050_4&quot;</td><td>443</td><td>0.979629</td></tr><tr><td>&quot;BEV per Capita&quot;</td><td>78</td><td>&quot;B19101_2_C_P1&quot;</td><td>490</td><td>0.991358</td></tr><tr><td>&quot;BEV per Capita&quot;</td><td>79</td><td>&quot;B08006_10&quot;</td><td>496</td><td>0.992723</td></tr><tr><td>&quot;BEV per Capita&quot;</td><td>84</td><td>&quot;B08017_6_P2&quot;</td><td>517</td><td>0.999047</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (137, 5)\n",
       "┌────────────────────┬───────┬───────────────┬─────────────┬─────────────┐\n",
       "│ Dependent Variable ┆ Order ┆ SubGroupID    ┆ NumFeatures ┆ Adjusted_R2 │\n",
       "│ ---                ┆ ---   ┆ ---           ┆ ---         ┆ ---         │\n",
       "│ str                ┆ i64   ┆ str           ┆ i64         ┆ f64         │\n",
       "╞════════════════════╪═══════╪═══════════════╪═════════════╪═════════════╡\n",
       "│ BEV Count          ┆ 1     ┆ B15002_3      ┆ 16          ┆ 0.944761    │\n",
       "│ BEV Count          ┆ 2     ┆ B25104_2      ┆ 32          ┆ 0.960681    │\n",
       "│ BEV Count          ┆ 3     ┆ B06009_4_C    ┆ 36          ┆ 0.967735    │\n",
       "│ BEV Count          ┆ 4     ┆ B25040_2      ┆ 45          ┆ 0.971215    │\n",
       "│ BEV Count          ┆ 5     ┆ C24050_2      ┆ 63          ┆ 0.973886    │\n",
       "│ …                  ┆ …     ┆ …             ┆ …           ┆ …           │\n",
       "│ BEV per Capita     ┆ 44    ┆ B06012_6_P1   ┆ 273         ┆ 0.932551    │\n",
       "│ BEV per Capita     ┆ 70    ┆ C24050_4      ┆ 443         ┆ 0.979629    │\n",
       "│ BEV per Capita     ┆ 78    ┆ B19101_2_C_P1 ┆ 490         ┆ 0.991358    │\n",
       "│ BEV per Capita     ┆ 79    ┆ B08006_10     ┆ 496         ┆ 0.992723    │\n",
       "│ BEV per Capita     ┆ 84    ┆ B08017_6_P2   ┆ 517         ┆ 0.999047    │\n",
       "└────────────────────┴───────┴───────────────┴─────────────┴─────────────┘"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the cutoff sub-group list\n",
    "selected_sg = set(\n",
    "    pl.read_csv(\"SelectedSubGroups_Cutoff40.csv\")[\"SubGroup\"].to_list()\n",
    ")\n",
    "print(f\"Selected sub-groups at cutoff: {len(selected_sg)}\")\n",
    "\n",
    "# Load and filter micro-individual results\n",
    "results = (\n",
    "    pl.read_csv(\"Results_micro_individual.csv\")\n",
    "    .filter(pl.col(\"Group Added\").is_in(selected_sg))\n",
    "    .rename({\n",
    "        \"Response\": \"Dependent Variable\",\n",
    "        \"Step\": \"Order\",\n",
    "        \"Group Added\": \"SubGroupID\",\n",
    "        \"Cumul. Predictors\": \"NumFeatures\",\n",
    "        \"Adj R\\u00b2\": \"Adjusted_R2\",\n",
    "    })\n",
    "    .select([\"Dependent Variable\", \"Order\", \"SubGroupID\", \"NumFeatures\", \"Adjusted_R2\"])\n",
    "    .sort(\"Dependent Variable\", \"Order\")\n",
    ")\n",
    "\n",
    "out_path = OUTPUT_DIR / \"FeatureSelection.csv\"\n",
    "results.write_csv(out_path)\n",
    "print(f\"Saved {out_path.name}  ({results.height} rows)\\n\")\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hdr04",
   "metadata": {},
   "source": [
    "## 5. Copy selected-subgroups dictionary\n",
    "\n",
    "`ACSEnhanced_Dictionary_SelectedSubGroups.csv` contains the dictionary\n",
    "rows for the sub-groups chosen by the stepwise regression at the cutoff."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "copy02",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ACSEnhanced_Dictionary_SelectedSubGroups.csv  ->  Output/  (11 cols, dropped 13)\n"
     ]
    }
   ],
   "source": [
    "DROP_COLS = [\n",
    "    \"Label_0\", \"Label_1\", \"Label_2\", \"Label_3\", \"Label_4\", \"Label_5\", \"Label_6\",\n",
    "    \"Parent\", \"Parent^2\", \"Parent^3\", \"Parent^4\",\n",
    "    \"Partition\", \"Cumulable\",\n",
    "]\n",
    "\n",
    "dict_df_out = pl.read_csv(\"ACSEnhanced_Dictionary_SelectedSubGroups.csv\")\n",
    "dict_df_out = dict_df_out.drop([c for c in DROP_COLS if c in dict_df_out.columns])\n",
    "dict_df_out.write_csv(OUTPUT_DIR / \"ACSEnhanced_Dictionary_SelectedSubGroups.csv\")\n",
    "print(f\"ACSEnhanced_Dictionary_SelectedSubGroups.csv  ->  Output/  ({dict_df_out.width} cols, dropped {len(DROP_COLS)})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hdr05",
   "metadata": {},
   "source": [
    "## 6. Compute derived features and save ACS Enhanced data\n",
    "\n",
    "For each geographic level, compute derived features (cumulative sums and\n",
    "proportional ratios) from the base ACS variables using formulas in the\n",
    "selected-subgroups dictionary.  Save the result as `ACSEnhanced_*.csv`\n",
    "with all 688 selected feature columns plus location identifier and\n",
    "geographic metadata (`LandArea`, `PopulationDensity`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "filt01",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selected features: 642 total (244 base, 398 derived)\n",
      "\n",
      "ACSEnhanced_ALZipCodes.csv\n",
      "  643 cols (642 features), 635 rows\n",
      "  Saved to Output/ACSEnhanced_ALZipCodes.csv\n",
      "\n",
      "ACSEnhanced_ALCounties.csv\n",
      "  643 cols (642 features), 67 rows\n",
      "  Saved to Output/ACSEnhanced_ALCounties.csv\n",
      "\n",
      "ACSEnhanced_USStates.csv\n",
      "  643 cols (642 features), 51 rows\n",
      "  Saved to Output/ACSEnhanced_USStates.csv\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# Load the selected-subgroups dictionary\n",
    "dict_df = pl.read_csv(\"ACSEnhanced_Dictionary_SelectedSubGroups.csv\")\n",
    "all_vars = dict_df[\"Variable\"].to_list()\n",
    "base_vars = [v for v in all_vars if \"(\" not in v and \"/\" not in v]\n",
    "derived_vars = [v for v in all_vars if \"(\" in v or \"/\" in v]\n",
    "print(f\"Selected features: {len(all_vars)} total \"\n",
    "      f\"({len(base_vars)} base, {len(derived_vars)} derived)\")\n",
    "\n",
    "# --- Formula parsing (same logic as notebook 3) ---\n",
    "\n",
    "def parse_sum(term: str) -> pl.Expr:\n",
    "    \"\"\"Parse a '+'-separated sum of variable names into a Polars expression.\"\"\"\n",
    "    parts = [t.strip() for t in term.split(\"+\")]\n",
    "    expr = pl.col(parts[0])\n",
    "    for p in parts[1:]:\n",
    "        expr = expr + pl.col(p)\n",
    "    return expr\n",
    "\n",
    "def formula_to_expr(formula: str) -> pl.Expr:\n",
    "    \"\"\"Convert a derived-variable formula string into a Polars expression.\"\"\"\n",
    "    clean = re.sub(r\"[()]\", \"\", formula.strip())\n",
    "    if \"/\" in clean:\n",
    "        numerator, denominator = clean.split(\"/\", 1)\n",
    "        return (parse_sum(numerator) / parse_sum(denominator)).fill_nan(0.0)\n",
    "    else:\n",
    "        return parse_sum(clean)\n",
    "\n",
    "def formula_refs(formula: str) -> list[str]:\n",
    "    \"\"\"Extract all base variable references from a formula string.\"\"\"\n",
    "    return re.findall(r\"[A-Z][A-Za-z0-9_]+_E\\d+\", formula)\n",
    "\n",
    "# --- Build output for each geographic level ---\n",
    "\n",
    "SUFFIX_TO_LOC_COL = {\n",
    "    \"ALZipCodes\": \"Zip Code\",\n",
    "    \"ALCounties\": \"County\",\n",
    "    \"USStates\":   \"State\",\n",
    "}\n",
    "\n",
    "LEVEL_TO_SUFFIX = {\n",
    "    \"ACSEnhanced_ALZipCodes.csv\":  \"ALZipCodes\",\n",
    "    \"ACSEnhanced_ALCounties.csv\":  \"ALCounties\",\n",
    "    \"ACSEnhanced_USStates.csv\":    \"USStates\",\n",
    "}\n",
    "\n",
    "for fname, suffix in LEVEL_TO_SUFFIX.items():\n",
    "    df = acs_frames[suffix]\n",
    "    loc_col = SUFFIX_TO_LOC_COL[suffix]\n",
    "    available = set(df.columns)\n",
    "\n",
    "    # Compute derived columns\n",
    "    new_exprs = []\n",
    "    skipped = 0\n",
    "    for var in derived_vars:\n",
    "        refs = formula_refs(var)\n",
    "        if not refs or not all(r in available for r in refs):\n",
    "            skipped += 1\n",
    "            continue\n",
    "        new_exprs.append(formula_to_expr(var).alias(var))\n",
    "\n",
    "    df = df.with_columns(new_exprs)\n",
    "\n",
    "    # Select location + all 688 features only\n",
    "    keep = [loc_col] + [v for v in all_vars if v in df.columns]\n",
    "\n",
    "    filtered = df.select(keep).sort(loc_col)\n",
    "    out_path = OUTPUT_DIR / fname\n",
    "    filtered.write_csv(out_path)\n",
    "\n",
    "    n_features = filtered.width - 1  # exclude location col\n",
    "\n",
    "    print(f\"\\n{fname}\")\n",
    "    print(f\"  {filtered.width} cols ({n_features} features), {filtered.height} rows\")\n",
    "    if skipped:\n",
    "        print(f\"  Skipped {skipped} derived vars (missing base columns)\")\n",
    "    print(f\"  Saved to Output/{fname}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hdr06",
   "metadata": {},
   "source": [
    "## 7. Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "sum01",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== Output directory ===\n",
      "  ACSEnhanced_ALCounties.csv                            527.0 KB\n",
      "  ACSEnhanced_ALZipCodes.csv                           3880.0 KB\n",
      "  ACSEnhanced_Dictionary_SelectedSubGroups.csv          146.1 KB\n",
      "  ACSEnhanced_USStates.csv                              469.1 KB\n",
      "  ACSGroups_Initial.csv                                   0.4 KB\n",
      "  FeatureSelection.csv                                    5.4 KB\n",
      "  GeneralElectionVotes_ALCounties.csv                     5.1 KB\n",
      "  GeneralElectionVotes_ALZipCodes.csv                    39.6 KB\n",
      "  GeneralElectionVotes_USStates.csv                       3.5 KB\n",
      "  VehicleCountsByEVAFuelType_ALCounties.csv               5.5 KB\n",
      "  VehicleCountsByEVAFuelType_ALZipCodes.csv              41.8 KB\n",
      "  VehicleCountsByEVAFuelType_USStates.csv                 5.0 KB\n"
     ]
    }
   ],
   "source": [
    "print(\"=== Output directory ===\")\n",
    "for f in sorted(OUTPUT_DIR.glob(\"*.csv\")):\n",
    "    size_kb = f.stat().st_size / 1024\n",
    "    print(f\"  {f.name:50s} {size_kb:8.1f} KB\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
